COVID-19 impact on mental health

Background The coronavirus disease 2019 (COVID-19) pandemic has posed a significant influence on public mental health. Current efforts focus on alleviating the impacts of the disease on public health and the economy, with the psychological effects due to COVID-19 relatively ignored. In this research, we are interested in exploring the quantitative characterization of the pandemic impact on public mental health by studying an online survey dataset of the United States. Methods The analyses are conducted based on a large scale of online mental health-related survey study in the United States, conducted over 12 consecutive weeks from April 23, 2020 to July 21, 2020. We are interested in examining the risk factors that have a significant impact on mental health as well as in their estimated effects over time. We employ the multiple imputation by chained equations (MICE) method to deal with missing values and take logistic regression with the least absolute shrinkage and selection operator (Lasso) method to identify risk factors for mental health. Results Our analysis shows that risk predictors for an individual to experience mental health issues include the pandemic situation of the State where the individual resides, age, gender, race, marital status, health conditions, the number of household members, employment status, the level of confidence of the future food affordability, availability of health insurance, mortgage status, and the information of kids enrolling in school. The effects of most of the predictors seem to change over time though the degree varies for different risk factors. The effects of risk factors, such as States and gender show noticeable change over time, whereas the factor age exhibits seemingly unchanged effects over time. Conclusions The analysis results unveil evidence-based findings to identify the groups who are psychologically vulnerable to the COVID-19 pandemic. This study provides helpful evidence for assisting healthcare providers and policymakers to take steps for mitigating the pandemic effects on public mental health, especially in boosting public health care, improving public confidence in future food conditions, and creating more job opportunities. Trial registration This article does not report the results of a health care intervention on human participants. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01411-w.


Background
Since the outbreak of the COVID-19 pandemic, people's lifestyle has been changed significantly. However, no sufficient resources have been available to attenuate the pandemic effects on mental health and well-being [1]. Various studies have been conducted to investigate how the COVID-19 pandemic may affect people psychologically. For example, Cao et al. [2] conducted a survey on college students in China and showed that more than 24% of the students were experiencing anxiety. Spoorthy et al. [3] investigated the mental health problems faced by healthcare workers during the COVID-19 pandemic.
While those studies provided descriptive results by summarizing the information obtained from the questionnaire, it is unclear how the impact of COVID-19 changes over time; what factors are relevant to describe the impact of the pandemic; and how the severity of the mental health issues is quantitatively associated with the risk factors. In this paper, we examine these questions and aim to provide some quantitative insights. Our explorations are carried out using a large scale online public survey study conducted by the U.S. Census Bureau [4]. The data include twelve data sets each collected in a 1-week window over 12 consecutive weeks from April 23, 2020 to July 21, 2020. Different data sets contain the measurements from different participants on the same questions. Among the 12 data sets, the smallest one contains 41,996 subjects and the largest one has 132,961 participants. We treat the survey in each week as an independent study. We are interested in assessing how the effects of the associated risk factors may change over time by applying the same method to each of the 12 data sets separately.
The survey includes multiple questions perceived to be relevant to describing the impact of the pandemic on the public. To quantitatively identify the risk factors for impacting the mental health by the pandemic, we engage the penalized logistic regression method, with the least absolute shrinkage and selection operator (Lasso) penalty [5]. However, a direct application of the Lasso method is not possible due to the presence of missing observations. To handle missing values, we employ the multiple imputation by chained equations (MICE) method (e.g., [6,7]). Further, survey data commonly involve measurement error due to recall bias, the inability of providing precise descriptions of some answers, and reporting errors. It is imperative to address this issue when pre-processing the data. To this end, we combine the levels of those highly related categorical variables to mitigate the measurement error effects.

Original survey data
The data used in this project are from phase 1 of the Household Pulse Survey conducted by the U.S. Census Bureau [4] from April 23, 2020 to July 21, 2020 for 12 consecutive weeks, giving rise to 12 data sets each for a week. The survey aims to study the pandemic impacts on the households across the United States from social and economic perspectives. The survey contains 50 questions ranging from education, employment, food sufficiency, health, housing, social security benefits, household spending, stimulus payments, to transportation. The participants of the survey come from all the 50 states plus Washington, D.C., United States, aging from 18 to 88. The gender ratio (the ratio of males to females) remains fairly stable ranging between 0.6 and 0.7 over the 12 weeks. Figure S1 in the Supplementary Material shows the curves of the number of cumulative confirmed cases for all the states which are grouped into four categories of the severity of the pandemic, derived from the data from the Centers for Disease Control and Prevention [8]. Table 1 lists the state members for each category, together with the total number of participants over the 12 weeks and the corresponding percentage for each category. It is seen that the majority (72.5%) of the participants of the survey come from the states with mild pandemic and the least proportion (2.3%) of subjects are from the states with a serious pandemic.

Pre-processing the data to reduce errors
Among the initial 50 questions, nine questions, such as "Where did you get free groceries or free meals" and "How often is the Internet available to children for educational purposes", are excluded because they are not perceived as sustainable factors on affecting mental health. Measurement error is typically involved in survey data. Prior to a formal analysis of the data, we implement a pre-processing procedure to mitigate the measurement error effects by combining questions to create new variables, or collapsing levels of variables to form binary variables. Information on mental health is collected via four questions concerning anxiety, worry, loss of interest, and feeling down. Each question is a four-level Likert item [9] with values 1, 2, 3 and 4, showing the degree of each aspect for the past 7 days prior to the survey time. In contrast to Twenge and Joiner [10] who combined the measurements of the first two questions anxiety and worry to indicate the anxiety level and the last two questions loss of interest and feeling down to show the depression level, we define a single binary response to reflect the mental health status of an individual by combing measurements of the four variables. The response variable takes value 1 if the average of the scores of the four variables is greater than 2.5, and 0 otherwise, where the threshold 2.5 is the median value for each question. This binary response gives a synthetic way to indicate the mental health status which is easier thaeach question. This binary response gives a synthetic wayn examining measurements of multiple variables.
Two variables describe the loss of work: Wrkloss indicates whether an individual in the household experiences a loss of employment income since March 13, 2020; Expctloss indicates if the individual expects a member in the household to experience a loss of employment income in the next 4 weeks because of the COVID-19 pandemic. These two variables are combined to form a single indicator which is denoted Wrkloss, with value 1 if at least one of these two events happens. Two ordinal variables, Prifoodsuf and Curfoodsuf, are used to describe the food sufficiency status before the pandemic and at present, respectively. The Foodcon.change variable is constructed by comparing the current and the previous food sufficiency status to form a binary variable, taking 1 if the current food sufficiency status is no worse than the food status before the pandemic, and 0 otherwise. Variable Med.delay.notget is combined from two indicator variables Delay (indicating if medical care is delayed) and Notget (indicating if the medical care is not received), taking value 1 if either medical care is delayed or no medical care is received, and 0 otherwise. Predictor Mort.prob is combined from one binary variable and an ordinal variable, taking 1 if a participant does not pay last month's rent or mortgage or does not have enough confidence in paying the next rent or mortgage payment on time, and 0 otherwise. In addition, three ordinal variables, Emppay, Healins, and Schoolenroll, are modified by collapsing their levels to form binary categories. Emppay has value 1 if he/she gets paid for the time he/she is not working, and 0 otherwise. Healins has value 1 if the individual is currently covered by the health insurance, and 0 otherwise. Schoolenroll has value 1 if there is a child in the household enrolled in school, and 0 otherwise. Except for the variables discussed above, the remaining variables are kept as in the original form.
The final data include the binary response (indicating the mental health status of an individual) and 25 predictors measuring various aspects of individuals. To be specific, nine predictors show basic information: State, Age, Male, Rhispanic, Race, Educ, MS (marital status), Numper (the number of people in the household), and Numkid (the number of people under 18 in the household); five variables concern the income and employment: Income, Wrkloss, Anywork, Kindwork, and Emppay; five variables are related to food: Foodcon.change, Freefood, Tspndfood, Tspndprpd, and Foodconf; three variables pertain to health and insurance: Hlthstatus, Healins, and Med.delay. notget; one variable, Mort.prob, is for mortgage and housing; and two variables, Schoolenroll and Ttch_Hrs, reflect child education. The variable dictionary for the pre-processed data is shown in Table 2.

Missing observations
In the data sets, 17 covariates together with the response variable have missing observations. To provide a quick and intuitive sense of the missingness proportions for different variables over the 12 data sets, we combine those data sets by individual variable to form a single pooled data set. Then we calculate the missingness proportion for each variable by dividing the number of missing observations in the variable by the total number of subjects in the pooled data set. We display in Fig. 1 the missingness rates for those 17 risk factors and the response variable (mental health status) for the pooled data. The risk factors having the three highest missingness rates are the variables Ttch_hrs, Schoolenroll and Emppay, and the corresponding missingness rates are 76.7%, 66.9% and 60.5%, respectively. Five variables incur higher than 30% missingness proportions, and the missingness proportion for 12 risk factors is larger than 5%. The missingness proportion for the response variable is about 8.6%.
Missing values present a challenge for data analysis and model fitting. One may perform the so-called complete data analysis by deleting those subjects with missing observations or the so-called available data analysis by using all available data, and then repeating a standard analysis procedure. Such analyses are easy to implement, however, biased results are expected if the missing completely at random (MCAR) mechanism is not true. Here we consider a broader setting where missing data do not necessarily follow the MCAR but follow the missing at random (MAR) mechanism. We employ the MICE method which is developed under the MAR mechanism and applies to various types of variables such as continuous, binary, nominal, and ordinal variables subject to missingness. A detailed discussion on this method was provided by van Buuren et al. [11].
Here we employ the MICE method to accommodate missing observations that are present in both the predictors and the response. Following the suggestion of Allison [12], we choose to do five imputations for the data in each week by employing the same algorithm with different random seeds. The implementation is conducted in R (version 3.6.1) with the R package: Multivariate Imputation by Chained Equation (mice). The details on the R code are presented in the code availability in the Declarations section.
To empirically assess the imputation results, we take the data in week 6 as an example and compare the five imputed data sets to the original data by displaying their distribution using the R function density for the  Figure  S2 in the Supplementary Material. It is seen that the distributions of the 5 imputed data sets for the three continuous variables, Tspndfood, Tspndprpd, and Ttch_hrs, are fairly similar to that of the original data. Further, in Tables S1, S2, and S3 in the Supplementary Material, we report the proportions of different levels for the categorical variables for both the imputed and original data, showing the similarity in the distributions of the imputed data and of the original data.

Model building and inference
We intend to employ logistic regression with the Lasso penalty to analyze the data that contain a binary response and potentially related predictors or covariates. First, we introduce the basic notation and discuss the method in general terms. For i = 1, …, n, let Y i represent the binary response with value 1 indicating that the mental health problem occurs for subject i and 0 otherwise. Let X ij denote the jth covariate for subject i, where j = 1, …, p, and p is the number of predictors. Write X i = (X i1 , X i2 , …, X ip ) T and let π i = P(Y i = 1| X i ).
Consider the logistic regression model where β = (β 1 , …,β p ) T denotes the vector of regression parameters. Consequently, the log-likelihood function for β is given by To select the predictors associated with the dichotomous response, we employ the Lasso method. The Lasso estimates are the values that maximize the penalized loglikelihood function obtained by adding an L 1 penalty to the expression (2): where λ is the tuning parameter. The 10-fold crossvalidation is employed to obtain a proper value for the tuning parameter and the one-standard-error rule [13] is Fig. 1 The missingness rates for the 17 risk factors and the response of the pooled data applied to pick the most parsimonious model within one standard error of the minimum cross-validation misclassification rate (e.g., [14]).

Model fitting and variable selection
The Lasso logistic regression is applied to each of the five imputed data sets for each week. The predictors corresponding to the nonzero coefficient estimates are considered the risk factors selected, which may be different across five imputed data sets for each of the 12 weeks.
To explore in a full spectrum, we start with two extreme models, called the full model by including the union of all the selected risk factors by the Lasso logistic regression, and the reduced model by including only the common factors selected for all five imputed data sets in any week. prob. We expect the predictors in the final model to form a set in-between the sets of the predictors for the reduced mode and the full model. Now, the problem is how to find the final model using the reduced and full models. To this end, we carry out the following four steps.
In Step 1, we fit logistic regression with predictors in the full model and in the reduced model, respectively, to each of the five surrogate data sets for each of the 12 weeks.
In Step 2, the estimates and standard errors of the model coefficients for a given week are obtained using the algorithm described by Allison [12]. To be specific, let M = 5 be the number of surrogate data sets for the original incomplete data. Let β j be the jth component of the model parameter vector β. For k = 1, …, M, let β (k) j denote the estimate of the model parameter β j obtained from fitting the kth surrogate data set in a week and let S (k) j be its associated standard error. Then the point estimate of β j is given by the average of those estimates of β j derived from the M imputed data sets: To determine the variability associated with β j , one needs to incorporate both the within imputation variance, denoted V w , and the between imputation variance, denoted V b . According to Rubin's rule [6], the total variance associated with the multiple imputation estimate β j is given by Var , and the between imputation factor 1 M . As a result, the standard error associated with β j is given by se β j = Var β j , i.e., We report in Tables S4 and S5 in the Supplementary Material the estimated results of the covariate effects obtained, respectively, from the full and reduced models for the data in 12 weeks, where the covariates marked with an asterisk are statistically significant with p-values smaller than 0.05 for more than 6 weeks. It is found that in addition to those covariates included in the reduced model, fitting the full model also shows that five additional covariates, State, Rhispanic, Race, Numper, and Schoolenroll, are statistically significant for more than 6 weeks' data. Table S5 shows that almost all the covariates in the reduced model are statistically significant, with all the p-values derived from the data in 12 weeks smaller than 0.05.
Consequently, in Step 3, we take the 11 significant risk factors from the reduced model, and the 5 additional partially significant covariates suggested by fitting the full model, State, Rhispanic, Race, Numper, and Schoolenroll, to form the list of risk factors for the final model. In Step 4, we construct the final model using the model form (1) to include the selected variables in Step 3 as predictors, where dummy variables are used to express categorical variables State, Race, MS, Foodconf, and Hlthstatus with levels more than two, yielding 28 variables in the model. The final model is then given by (5) where β j is the regression coefficients for j = 0, 1, …, 28, and the subscript i is suppressed in π and the covariates for ease of exposition.
Then, we fit the final logistic model (6) to each of the imputed data sets for each of the 12 weeks; in the same manner as indicated by (4) and (5), we obtain the point estimates of the model parameters and the associated standard errors. To have a visual display, we plot in Fig. 2 the estimates of the coefficients for all the factors in the final model for 12 weeks; to precisely show the estimates, we report in Table 3 the point estimates for the covariate effects obtained from the final model, where we further calculate the average of the 12 estimates for each covariate and report the results in the last column. The associated standard errors and the p-values are deferred to Table S6 in Figure 2 shows that the absolute values of coefficient estimates for some levels of variables Foodconf and Hlthstatus are greater than 1 (in Fig. 2K and L). The coefficient estimates of Med.delay.notget over 12 weeks are between 0.5 and 1 (in Fig. 2N). Other variables have coefficient estimates between -0.5 and 0.5.

Results
To have an overall sense of the estimates of the predictor effects in the final model, we now utilize the averages reported in the last column of Table 3 to estimate the relative change in the odds of having mental issues with one unit increase in a predictor from its baseline while keeping other predictors unchanged, yet leaving the associated variability uncharacterized. Let β j represent the average of those estimates of the covariate effect β j over the 12 weeks for j = 1, …, 28, which is a sensible estimate of β j , because the arithmetic average preserves the consistency if all the estimators obtained for the 12 weeks are consistent for β j . Using β j is advantageous in offering us a single estimate of β j with generally expected smaller variability than those estimates obtained from each of the 12 weeks. If β j is negative, then 1 − exp β j shows an estimate of the decrease in the odds of having mental issues relative to the baseline; if β j is positive, then exp β j − 1 suggests an estimate of the increase in the odds of having mental issues relative to the baseline.
To be specific, for the variable State with large daily increases of cases as the baseline, people from mild pandemic States exhibit an estimate of Table 3 Results of the Final Model The table shows the point estimates of the covariate effects individually derived from the data in each of 12 weeks, together with their averages over 12 weeks shown in the last column    For Age and Gender, their averages of the estimates over the 12 weeks are -0.030 and -0.228, respectively, implying that one unit increase in Age is associated with about an estimate of 1 − exp (−0.030) ≈ 2.96% decrease in the odds of occurrence of mental health problems; and being a male relative to a female is associated with an estimate of 1 − exp (−0.228) ≈ 20.39% decrease in the odds of having mental health issues. Similarly, the 12-week estimated effects of Rhispanic indicate that the origin of Hispanic, Latino or Spanish is associated with a smaller odds of having mental issues than others. The 12-week mean of the coefficient estimates of Rhispanic is -0.172, leading to an estimate of the odds of mental health problem occurrence being reduced by around 1 − exp (−0.172) ≈ 15.80%.
For the variable Race with White as the baseline, the 12-week mean of coefficient estimates for Black For predictors Numper and Numkid, the averages of the estimates suggest that the increase of the number of people and kids in the household is associated with the decrease of the odds of having mental issues. Specifically, one person increase in the household is associated with an estimate of 1 − exp (−0.024)≈2.37% decrease in odds, and one more kid in the household is associated with an estimate of 1 − exp (−0.106)≈10.06% decrease in the odds.
For the work-related factors Wrkloss and Anywork, the results shown in the last column in Table 3 indicate that experiencing a loss of employment income since March 13, 2020 is associated with an estimate of exp(0.352) − 1≈42.19% increase in the odds of having mental issues, and doing any work during the last 7 days is associated with an estimate of 1 − exp (−0.141)≈13.15% decrease in the odds.
The 12-week results of Foodconf in Table 3 show that, with the not at all confident on the future food affordability as the baseline, an increase in the confidence of food affordability is negatively associated with the odds of having mental issues. On average of 12 weeks, shown in the last column in Table 3, the more confident in the food affordability, the less the odds of having mental issues. For example, the person who is very confident (Foodconf4) in the food affordability for the next four weeks demonstrates an estimate of 1 − exp (−1.348)≈74.02% decrease in the odds of having mental issues, relative to the person who is not at all confident.
With excellent health conditions as the baseline, the estimates of Hlthstatus in Table 3 say that the worse the self-evaluated health condition, the larger the odds of having mental issues. Considering the worst level of health condition poor (Hlthstatus5) as an example, the average of the estimates over the 12 weeks yields that people in poor health conditions have an estimate of the odds of having mental issues exp(2.021)≈7.55 times higher than people of excellent health conditions. For other health-related predictors, Healins and Med.delay.notget, people who are currently covered by health insurance are associated with an estimate of 1 − exp (−0.083)≈7.96% decrease in the odds of mental issues occurrence, and people who do not get medical care or have delayed medical care are associated with an estimate of exp(0.684) − 1≈98.18% increase in the odds.
For Mort.prob and Schoolenroll, people having rental or mortgage problems are associated with an estimate of exp(0.232) − 1≈26.15% increase in the odds of having mental health problems, and people whose household has kids enrolled in school are associated with an estimate of exp(0.109) − 1≈11.52% increase in the odds of having mental issues.
In summary, the factors in the final model associated with a reduction in the odds of having mental health issues include: States not having large daily increases of cases, older in age, being male, having a Hispanic, Latino or Spanish origin, being non-White, more people or kids in the household, having job during the last 7 days, having confidence in the food affordability in the future, and being covered by insurance. The factors in the final model associated with the increase in the odds of getting mental issues are: not married, experiencing loss of job, poor self-evaluations on health conditions, having problems in getting medical care and mortgage, and having kids enrolled in school.

Discussion
In this paper we investigate the impact of the COVID-19 pandemic on the public mental health using an online survey data set from the United States. Prior to the analysis, we pre-process the data by combining some levels of